49 research outputs found

    A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition

    Get PDF
    Discovering a three dimensional structure of a protein is a challenging task in biological science. Classifying a protein into one of its folds is an intermediate step for deciphering the three dimensional protein structure. The protein fold recognition can be done by developing feature extraction techniques to accurately extract all the relevant information from a protein sequence and then by employing a suitable classifier to label an unknown protein. Several feature extraction techniques have been developed in the past but with limited recognition accuracy only. In this work, we have developed a feature extraction technique which is based on bi-grams computed directly from Position Specific Scoring Matrices and demonstrated its effectiveness on a benchmark dataset. The proposed technique exhibits an absolute improvement of around 10% compared with existing feature extraction techniques

    Bigram - PGK: phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix

    Get PDF
    Background: The biological process known as post-translational modification (PTM) is a condition whereby proteomes are modified that affects normal cell biology, and hence the pathogenesis. A number of PTMs have been discovered in the recent years and lysine phosphoglycerylation is one of the fairly recent developments. Even with a large number of proteins being sequenced in the post-genomic era, the identification of phosphoglycerylation remains a big challenge due to factors such as cost, time consumption and inefficiency involved in the experimental efforts. To overcome this issue, computational techniques have emerged to accurately identify phosphoglycerylated lysine residues. However, the computational techniques proposed so far hold limitations to correctly predict this covalent modification. Results: We propose a new predictor in this paper called Bigram-PGK which uses evolutionary information of amino acids to try and predict phosphoglycerylated sites. The benchmark dataset which contains experimentally labelled sites is employed for this purpose and profile bigram occurrences is calculated from position specific scoring matrices of amino acids in the protein sequences. The statistical measures of this work, such as sensitivity, specificity, precision, accuracy, Mathews correlation coefficient and area under ROC curve have been reported to be 0.9642, 0.8973, 0.8253, 0.9193, 0.8330, 0.9306, respectively. Conclusions: The proposed predictor, based on the feature of evolutionary information and support vector machine classifier, has shown great potential to effectively predict phosphoglycerylated and non-phosphoglycerylated lysine residues when compared against the existing predictors. The data and software of this work can be acquired from https://github.com/abelavit/Bigram-PGK

    Protein fold recognition using genetic algorithm optimized voting scheme and profile bigram

    Get PDF
    In biology, identifying the tertiary structure of a protein helps determine its functions. A step towards tertiary structure identification is predicting a protein’s fold. Computational methods have been applied to determine a protein’s fold by assembling information from its structural, physicochemical and/or evolutionary properties. It has been shown that evolutionary information helps improve prediction accuracy. In this study, a scheme is proposed that uses the genetic algorithm (GA) to optimize a weighted voting scheme to improve protein fold recognition. This scheme incorporates k-separated bigram transition probabilities for feature extraction, which are based on the Position Specific Scoring Matrix (PSSM). A set of SVM classifiers are used for initial classification, whereupon their predictions are consolidated using the optimized weighted voting scheme. This scheme has been demonstrated on the Ding and Dubchak (DD), Extended Ding and Dubchak (EDD) and Taguchi and Gromhia (TG) datasets benchmarked data sets

    Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction

    Get PDF
    Post-translational modification is considered an important biological mechanism with critical impact on the diversification of the proteome. Although a long list of such modifications has been studied, succinylation of lysine residues has recently attracted the interest of the scientific community. The experimental detection of succinylation sites is an expensive process, which consumes a lot of time and resources. Therefore, computational predictors of this covalent modification have emerged as a last resort to tackling lysine succinylation. In this paper, we propose a novel computational predictor called ‘Success’, which efficiently uses the structural and evolutionary information of amino acids for predicting succinylation sites. To do this, each lysine was described as a vector that combined the above information of surrounding amino acids. We then designed a support vector machine with a radial basis function kernel for discriminating between succinylated and non-succinylated residues. We finally compared the Success predictor with three state-of-the-art predictors in the literature. As a result, our proposed predictor showed a significant improvement over the compared predictors in statistical metrics, such as sensitivity (0.866), accuracy (0.838) and Matthews correlation coefficient (0.677) on a benchmark dataset. The proposed predictor effectively uses the structural and evolutionary information of the amino acids surrounding a lysine. The bigram feature extraction approach, while retaining the same number of features, facilitates a better description of lysines. A support vector machine with a radial basis function kernel was used to discriminate between modified and unmodified lysines. The aforementioned aspects make the Success predictor outperform three state-of-the-art predictors in succinylation detection

    SumSec: accurate prediction of Sumoylation sites using predicted secondary structure

    Get PDF
    Post Translational Modification (PTM) is defined as the modification of amino acids along the protein sequences after the translation process. These modifications significantly impact on the functioning of proteins. Therefore, having a comprehensive understanding of the underlying mechanism of PTMs turns out to be critical in studying the biological roles of proteins. Among a wide range of PTMs, sumoylation is one of the most important modifications due to its known cellular functions which include transcriptional regulation, protein stability, and protein subcellular localization. Despite its importance, determining sumoylation sites via experimental methods is time-consuming and costly. This has led to a great demand for the development of fast computational methods able to accurately determine sumoylation sites in proteins. In this study, we present a new machine learning-based method for predicting sumoylation sites called SumSec. To do this, we employed the predicted secondary structure of amino acids to extract two types of structural features from neighboring amino acids along the protein sequence which has never been used for this task. As a result, our proposed method is able to enhance the sumoylation site prediction task, outperforming previously proposed methods in the literature. SumSec demonstrated high sensitivity (0.91), accuracy (0.94) and MCC (0.88). The prediction accuracy achieved in this study is 21% better than those reported in previous studies. The script and extracted features are publicly available at: https://github.com/YosvanyLopez/SumSec

    PyFeat: a python - based effective feature generation tool for DNA, RNA, and protein sequences

    Get PDF
    Extracting useful feature set which contains significant discriminatory information is a critical step in effectively presenting sequence data to predict structural, functional, interaction and expression of proteins, DNAs, and RNAs. Also, being able to filter features with significant information and avoid sparsity in the extracted features require the employment of efficient feature selection techniques. Here we present PyFeat as a practical and easy to use toolkit implemented in Python for extracting various features from proteins, DNAs, and RNAs. To build PyFeat we mainly focused on extracting features that capture information about the interaction of neighboring residues to be able to provide more local information. We then employ AdaBoost technique to select features with maximum discriminatory information. In this way, we can significantly reduce the number of extracted features and enable PyFeat to represent the combination of effective features from large neighboring residues. As a result, PyFeat is able to extract features from 13 different techniques and represent context free combination of effective features. The source code for PyFeat standalone toolkit and employed benchmarks with a comprehensive user manual explaining its system and workflow in a step by step manner are publicly available

    HseSUMO: Sumoylation site prediction using half - sphere exposures of amino acids residues

    Get PDF
    Background Post-translational modifications are viewed as an important mechanism for controlling protein function and are believed to be involved in multiple important diseases. However, their profiling using laboratory-based techniques remain challenging. Therefore, making the development of accurate computational methods to predict post-translational modifications is particularly important for making progress in this area of research. Results This work explores the use of four half-sphere exposure-based features for computational prediction of sumoylation sites. Unlike most of the previously proposed approaches, which focused on patterns of amino acid co-occurrence, we were able to demonstrate that protein structural based features could be sufficiently informative to achieve good predictive performance. The evaluation of our method has demonstrated high sensitivity (0.9), accuracy (0.89) and Matthew’s correlation coefficient (0.78–0.79). We have compared these results to the recently released pSumo-CD method and were able to demonstrate better performance of our method on the same evaluation dataset. Conclusions The proposed predictor HseSUMO uses half-sphere exposures of amino acids to predict sumoylation sites. It has shown promising results on a benchmark dataset when compared with the state-of-the-art method
    corecore